This markdown documents the journey of cleaning even a small amount of TEROS data for the TEMPEST 2 flooding events. Initial attemps to clean these data resulted in too many decision points, and thus the need to carefully document and justify how we are deciding to clean these data. Ideally, this will help writing associated methods, and explaining decisions to co-authors.
Looking at all the TEROS data from the time-period of interest, it’s clear that there’s a lot going on, including non-responsive sensors, high intra-plot variability, and different responses to the flooding events.
There are two immediate issues with the 5-minute data: 1) static sensors with high VWC (> 0.75), and 2) many of the sensors are missing data for some or most of the time-period of interest. This is a bummer, since the 5-minute data temporally matches our DO and redox datasets.
The 15-minute data looks generally better in terms of continuity across the time-frame of interest, but has the same issue with flatlined sensors.
First step is the easiest, which is to scrub the high VWC sensors. Conveniently, they’re all with values of 0.75, and none of the other sensors are, so we can easily trim those out. While we’re at it, let’s make plots for the other two variables as well, to see if they need initial cleaning:
We have one rogue sensor reading really high temperatures >30, which doesn’t make sense and is a clear outlier. Let’s also remove that sensor.
There are a couple potential routes here:
IF the 15-minute and 5-minute data are comparable, then either 3 or 4 should be used. If they aren’t we should used 1. So, let’s compare them:
I’m going to go with #3 above, because the 15-minute record is complete, so even though the same record for either 5-min or 15-min should be the same, we will keep all 15-minute records where possible (and thus the full time-frame) and fill in 5-min values as available. My main concern is we will now have unequal sample sizes for different periods of the same length (i.e. 6/3, with 5-minute data missing, a day will have 4x24=96 values, while 6/9 will have 12*24=288 values). This likely doesn’t matter much since we have such high sample-sizes across our datasets, but something to keep in mind. Let’s revisualize our newly merged dataset